import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%config InlineBackend.figure_format = 'svg'
import warnings
warnings.filterwarnings('ignore')
from plotnine import *
from google.colab import drive,files
drive.mount('/content/gdrive')
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
The customer base dataset used in this work is made available by IBM and downloaded from Kaggle. It is related to an anonymous telecom company and contains 7043 customers data with 21 attributes where each row represents a customer and each column contains customer’s attributes. Link for the dataset : https://www.kaggle.com/blastchar/telco-customer-churn
df=pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/HW_DAY13_Indra/TChurn.csv')
df
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | 6840-RESVB | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.5 | No |
| 7039 | 2234-XADUH | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.9 | No |
| 7040 | 4801-JZAZL | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | 8361-LTMKD | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.6 | Yes |
| 7042 | 3186-AJIEK | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.5 | No |
7043 rows × 21 columns
from pandas_profiling import ProfileReport
pandas_profiling_report = df.profile_report(
title="Pandas Profiling Report",
explorative=True, #More analysis (Unicode)
html={'style':
{'full_width': True,
'theme':'flatly'} #available theme options: ‘bootswatch’ ,'flatly','united'
},
interactions = {'targets' : ['Churn']} , #When a list of variable names is given, only interactions between these and all other variables are given.
progress_bar=True)
pandas_profiling_report.to_file("pandas_profiling_report.html")
row_n,col_n=df.shape
print(f'Datasets consist of {row_n} rows and {col_n} columns')
Datasets consist of 7043 rows and 21 columns
Let's describe each feature shortly:
The analysis is carried out to answer this problem :
Who are the customers more likely to churn and what are the probable reasons behind that decision?
What actions can be taken to prevent them from leaving?
df_desc=df.isna().sum(axis=0).reset_index()
df_desc.columns=['Variable','Missing_values']
df_desc['Filling_Values (%)']=(row_n-df_desc['Missing_values'])*100/row_n
df_desc['Dtypes']=df.dtypes.values
df_desc
| Variable | Missing_values | Filling_Values (%) | Dtypes | |
|---|---|---|---|---|
| 0 | customerID | 0 | 100.0 | object |
| 1 | gender | 0 | 100.0 | object |
| 2 | SeniorCitizen | 0 | 100.0 | int64 |
| 3 | Partner | 0 | 100.0 | object |
| 4 | Dependents | 0 | 100.0 | object |
| 5 | tenure | 0 | 100.0 | int64 |
| 6 | PhoneService | 0 | 100.0 | object |
| 7 | MultipleLines | 0 | 100.0 | object |
| 8 | InternetService | 0 | 100.0 | object |
| 9 | OnlineSecurity | 0 | 100.0 | object |
| 10 | OnlineBackup | 0 | 100.0 | object |
| 11 | DeviceProtection | 0 | 100.0 | object |
| 12 | TechSupport | 0 | 100.0 | object |
| 13 | StreamingTV | 0 | 100.0 | object |
| 14 | StreamingMovies | 0 | 100.0 | object |
| 15 | Contract | 0 | 100.0 | object |
| 16 | PaperlessBilling | 0 | 100.0 | object |
| 17 | PaymentMethod | 0 | 100.0 | object |
| 18 | MonthlyCharges | 0 | 100.0 | float64 |
| 19 | TotalCharges | 0 | 100.0 | object |
| 20 | Churn | 0 | 100.0 | object |
Based on above dataframe, it appears that :
#Check all values of TotalCharges
df.sort_values(by='TotalCharges').head(15)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 936 | 5709-LVOEQ | Female | 0 | Yes | Yes | 0 | Yes | No | DSL | Yes | Yes | Yes | No | Yes | Yes | Two year | No | Mailed check | 80.85 | No | |
| 3826 | 3213-VVOLG | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.35 | No | |
| 4380 | 2520-SGTTA | Female | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.00 | No | |
| 753 | 3115-CZMZD | Male | 0 | No | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.25 | No | |
| 5218 | 2923-ARZLG | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Yes | Mailed check | 19.70 | No | |
| 3331 | 7644-OMVMY | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 19.85 | No | |
| 6754 | 2775-SEFEE | Male | 0 | No | Yes | 0 | Yes | Yes | DSL | Yes | Yes | No | Yes | No | No | Two year | Yes | Bank transfer (automatic) | 61.90 | No | |
| 6670 | 4075-WKNIU | Female | 0 | Yes | Yes | 0 | Yes | Yes | DSL | No | Yes | Yes | Yes | Yes | No | Two year | No | Mailed check | 73.35 | No | |
| 1340 | 1371-DWPAZ | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | Yes | Yes | Yes | Yes | No | Two year | No | Credit card (automatic) | 56.05 | No | |
| 488 | 4472-LVYGI | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | No | Yes | Yes | Yes | No | Two year | Yes | Bank transfer (automatic) | 52.55 | No | |
| 1082 | 4367-NUYAO | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.75 | No | |
| 105 | 6180-YBIQI | Male | 0 | No | No | 5 | No | No phone service | DSL | No | No | No | No | No | No | Month-to-month | No | Mailed check | 24.30 | 100.2 | No |
| 4459 | 3178-FESZO | Female | 0 | No | No | 1 | Yes | Yes | Fiber optic | No | Yes | No | No | Yes | Yes | Month-to-month | Yes | Credit card (automatic) | 100.25 | 100.25 | Yes |
| 1723 | 2587-YNLES | Female | 0 | Yes | Yes | 6 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.10 | 100.35 | No |
| 2124 | 7802-EFKNY | Male | 0 | Yes | No | 5 | No | No phone service | DSL | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 24.95 | 100.4 | Yes |
After sorting the TotalCharges, we can clearly see that the roots of error when converting the datatypes are ' ' characters. By looking at the tenure columns, we find out that these anomalies may appear because the tenure values is 0 and one month period hasn't been completed yet or maybe there are free trials. Since this anomaly only appears in 11 rows (only 0.1% of total rows), these rows will be removed.
df[df.TotalCharges==' '] #The rows that will be removed
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 488 | 4472-LVYGI | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | No | Yes | Yes | Yes | No | Two year | Yes | Bank transfer (automatic) | 52.55 | No | |
| 753 | 3115-CZMZD | Male | 0 | No | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.25 | No | |
| 936 | 5709-LVOEQ | Female | 0 | Yes | Yes | 0 | Yes | No | DSL | Yes | Yes | Yes | No | Yes | Yes | Two year | No | Mailed check | 80.85 | No | |
| 1082 | 4367-NUYAO | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.75 | No | |
| 1340 | 1371-DWPAZ | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | Yes | Yes | Yes | Yes | No | Two year | No | Credit card (automatic) | 56.05 | No | |
| 3331 | 7644-OMVMY | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 19.85 | No | |
| 3826 | 3213-VVOLG | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.35 | No | |
| 4380 | 2520-SGTTA | Female | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.00 | No | |
| 5218 | 2923-ARZLG | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Yes | Mailed check | 19.70 | No | |
| 6670 | 4075-WKNIU | Female | 0 | Yes | Yes | 0 | Yes | Yes | DSL | No | Yes | Yes | Yes | Yes | No | Two year | No | Mailed check | 73.35 | No | |
| 6754 | 2775-SEFEE | Male | 0 | No | Yes | 0 | Yes | Yes | DSL | Yes | Yes | No | Yes | No | No | Two year | Yes | Bank transfer (automatic) | 61.90 | No |
df=df[df.TotalCharges!=' '] #the problematic rows have been removed
df['TotalCharges']=df['TotalCharges'].apply(lambda x:float(x))
df
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | 6840-RESVB | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No |
| 7039 | 2234-XADUH | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No |
| 7040 | 4801-JZAZL | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | 8361-LTMKD | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes |
| 7042 | 3186-AJIEK | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No |
7032 rows × 21 columns
#Re-checking the datatypes and missing values
df_desc=df.isna().sum(axis=0).reset_index()
df_desc.columns=['Variable','Missing_values']
df_desc['Filling_Values (%)']=(row_n-df_desc['Missing_values'])*100/row_n
df_desc['Dtypes']=df.dtypes.values
df_desc
| Variable | Missing_values | Filling_Values (%) | Dtypes | |
|---|---|---|---|---|
| 0 | customerID | 0 | 100.0 | object |
| 1 | gender | 0 | 100.0 | object |
| 2 | SeniorCitizen | 0 | 100.0 | int64 |
| 3 | Partner | 0 | 100.0 | object |
| 4 | Dependents | 0 | 100.0 | object |
| 5 | tenure | 0 | 100.0 | int64 |
| 6 | PhoneService | 0 | 100.0 | object |
| 7 | MultipleLines | 0 | 100.0 | object |
| 8 | InternetService | 0 | 100.0 | object |
| 9 | OnlineSecurity | 0 | 100.0 | object |
| 10 | OnlineBackup | 0 | 100.0 | object |
| 11 | DeviceProtection | 0 | 100.0 | object |
| 12 | TechSupport | 0 | 100.0 | object |
| 13 | StreamingTV | 0 | 100.0 | object |
| 14 | StreamingMovies | 0 | 100.0 | object |
| 15 | Contract | 0 | 100.0 | object |
| 16 | PaperlessBilling | 0 | 100.0 | object |
| 17 | PaymentMethod | 0 | 100.0 | object |
| 18 | MonthlyCharges | 0 | 100.0 | float64 |
| 19 | TotalCharges | 0 | 100.0 | float64 |
| 20 | Churn | 0 | 100.0 | object |
The TotalCharges column is already detected as float64 variable
df.drop('customerID',axis=1,inplace=True) #drop the id column since all its values are unique
df
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No |
| 7039 | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No |
| 7040 | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes |
| 7042 | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No |
7032 rows × 20 columns
fig,ax=plt.subplots(1,3)
ax[0].boxplot(df.MonthlyCharges)
ax[0].set_xlabel('MonthlyCharges')
ax[0].set_xticklabels([])
ax[0].set_xticks([])
ax[1].boxplot(df.TotalCharges)
ax[1].set_xlabel('TotalCharges')
ax[1].set_xticklabels([])
ax[1].set_xticks([])
ax[2].boxplot(df.tenure)
ax[2].set_xlabel('tenure')
ax[2].set_xticklabels([])
ax[2].set_xticks([])
fig.text(0.12,.95,'Boxplot of numerical variable within the dataset',fontweight='bold',ha='left',fontsize=15)
fig.text(0.12,.9,'No outlier detected for all numerical variables',ha='left',fontsize=13)
fig.set_size_inches(12,6)
plt.savefig('outlier1.jpg',dpi=360,bbox_inches='tight')
#files.download("outlier1.jpg")
First, let's create a function to calculate count and percentage of each category within one categorical variable
def count_with_percent(var,df=df):
nrow,ncol=df.shape
df_agg=df.groupby(by=var)[var].agg(['count']).reset_index()
df_agg['percentage']=df_agg['count']*100/nrow
return df_agg
df_agg_churn=count_with_percent('Churn')
fig,ax=plt.subplots(figsize=(4,4))
#palet=sns.color_palette("OrRd",2)
sns.barplot(data=df_agg_churn,x='Churn',y='count',ax=ax)
for x,y,z in zip(df_agg_churn.index,df_agg_churn['count'],df_agg_churn.percentage):
ax.annotate(str(round(z,2))+'%',(x,0.5*y),ha='center',weight='bold')
fig.text(0.12,.95,'Count of Churn Variable',fontweight='bold',ha='left',fontsize=11)
fig.text(0.12,.9,'26.58% percent of customers prefer to churn',ha='left',fontsize=10)
plt.savefig('churn_univariate.jpg',dpi=360,bbox_inches='tight')
files.download("churn_univariate.jpg")
It can be seen that there is an imbalance for the target variable since it contains 5163 rows of No entries (73.42\%) and 1869 rows of Yes entries (26.58\%), indicates that the corresponding company has 26.57\% churn rate within the last month
#reconvert the SeniorCitizen value into Yes,No
df['SeniorCitizen']=df['SeniorCitizen'].apply(lambda x: 'Yes' if x==1 else 'No')
df
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | Male | No | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No |
| 7039 | Female | No | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No |
| 7040 | Female | No | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | Male | Yes | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes |
| 7042 | Male | No | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No |
7032 rows × 20 columns
Some function will be made to create the countplot more easily
def bar_count_plot(axess,item_var,df=df):
nrow,ncol=df.shape
df_agg=df.groupby(by=item_var)[item_var].agg(['count']).reset_index()
df_agg['percentage']=df_agg['count']*100/nrow
#palet=sns.color_palette("OrRd",df_agg.shape[0])
sns.barplot(data=df_agg,x=item_var,y='count',ax=axess)
for x,y,z in zip(df_agg.index,df_agg['count'],df_agg.percentage):
axess.annotate(str(round(z,2))+'%',(x,0.5*y),ha='center',weight='bold')
#fig.text(0.12,.9,'Count of '+item_var,fontweight='bold',ha='left',fontsize=11)
axess.set_title(item_var)
axess.set_xlabel('')
return axess
demo_var=['gender','SeniorCitizen','Partner','Dependents']
fig,ax=plt.subplots(2,2,figsize=(10,7))
ax[0,0]=bar_count_plot(axess=ax[0,0],item_var=demo_var[0])
ax[0,1]=bar_count_plot(axess=ax[0,1],item_var=demo_var[1])
ax[1,0]=bar_count_plot(axess=ax[1,0],item_var=demo_var[2])
ax[1,1]=bar_count_plot(axess=ax[1,1],item_var=demo_var[3])
fig.text(0.12,.93,'Customer Demographic Attributes',fontweight='bold',ha='left',fontsize=13)
plt.savefig('customer_demo.jpg',dpi=360,bbox_inches='tight')
files.download("customer_demo.jpg")
One can find that within the sample, the amount of each gender is approximately equal. The same applies to the Partner variable as the number of customers with or without a partner is also roughly similar. The distribution of imbalances can be seen by other features that around 16\% of customers are considered elderly and 30\% of customers have dependents.
demo_var=['Contract','PaperlessBilling','PaymentMethod']
fig,ax=plt.subplots(2,2,figsize=(10,7))
ax[0,0]=bar_count_plot(axess=ax[0,0],item_var=demo_var[0])
ax[0,1]=bar_count_plot(axess=ax[0,1],item_var=demo_var[1])
ax[1,0]=bar_count_plot(axess=ax[1,0],item_var=demo_var[2])
for label in ax[1,0].get_xticklabels():
label.set_ha("center")
label.set_rotation(10)
ax[1,1].axis('off')
fig.text(0.12,.93,'Customer Account Attributes',fontweight='bold',ha='left',fontsize=13)
plt.savefig('customer_acc.jpg',dpi=360,bbox_inches='tight')
files.download("customer_acc.jpg")
Contract, PaperlessBilling, and Payment method give informations about customer account and transaction method. From above figure, it is obvious that most customers prefer a month-to-month contract rather than a yearly contract. Moreover, for the transaction method, most customers opting for paperless billing and using electronic check.
demo_var=['PhoneService','MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
fig,ax=plt.subplots(3,3,figsize=(12,12))
k=0
for i in range(3):
for j in range(3):
ax[i,j]=bar_count_plot(axess=ax[i,j],item_var=demo_var[k])
k=k+1
if j>0 :
ax[i,j].set_ylabel('')
#ax[1,1].axis('off')
fig.text(0.12,.91,'Customer Service Attributes',fontweight='bold',ha='left',fontsize=13)
plt.savefig('customer_service.jpg',dpi=360,bbox_inches='tight')
files.download("customer_service.jpg")
There are 9 variables that provide information about the services the customer signs up for. The bar charts shows that approximately 90\% of customers subscribe to telephone service and 42\% also subscribe to multi-line service as well. It also shows that the majority of customers also subscribe to an Internet service with 44\% of customers prefer the type of fiber optic service and 34\% choose DSL. It is apparent that most customers who use the Internet service prefer not to use some additional Internet services such as OnlineSecurity, OnlineBackup, DeviceProtection and TechSupport. However, the same thing does not apply to the streaming service since the number of customers using or not using the service is similar. This also means that the streaming service is the most popular among the additional internet services.
fig,ax=plt.subplots(2,2,figsize=(10,8))
sns.histplot(data=df,x='tenure',ax=ax[0,0],kde=True)
sns.histplot(data=df,x='MonthlyCharges',ax=ax[0,1],kde=True)
sns.histplot(data=df,x='TotalCharges',ax=ax[1,0],kde=True)
ax[1,1].axis('off')
fig.text(0.12,.91,'The Histogram of Numeric Variables',fontweight='bold',ha='left',fontsize=13)
plt.savefig('numeric_univ.jpg',dpi=360,bbox_inches='tight')
files.download("numeric_univ.jpg")
For numeric variables, univariate analysis will be conducted using the histogram. In the preceding section, it is already known that there are no outliers in the dataset. However, the distribution of each variable is not yet known and needs to be visualized. The tenure histogram shows that the distribution of its values is considered bi-modal (2 peaks), indicates that the dataset is concentrated around 2 clusters. One of them is customers with less than 10 months tenure and the last is loyal customers, signed up for 65 months or more. From another point of view, one can also see that the company has indeed been able to attract many customers in the last 10 months. The bimodal distribution is also visible in the MonthlyCharges distribution where the data set is focused around two groups, i.e customers who purchased the basic service (cheapest price) with the amount of 20 dollars only and other is the customer who purchased multi-services with the amount of approximately 80 dollars per month. Apart from its positively skewed form, it is difficult to interpret any useful information from the TotalCharges histogram since TotalCharges is roughly computed from MonthlyCharges multiplied by tenure. Since univariate analysis does not provide enough information to answer the problems, bivariate analysis will be conducted in the next section with great focus to the target variable, Churn.
df_encode=df.copy() #creating the encoding dataset for the correlation matrix
df_encode=pd.get_dummies(df_encode,drop_first=True) #Full rank dummies
df_encode
| tenure | MonthlyCharges | TotalCharges | gender_Male | SeniorCitizen_Yes | Partner_Yes | Dependents_Yes | PhoneService_Yes | MultipleLines_No phone service | MultipleLines_Yes | InternetService_Fiber optic | InternetService_No | OnlineSecurity_No internet service | OnlineSecurity_Yes | OnlineBackup_No internet service | OnlineBackup_Yes | DeviceProtection_No internet service | DeviceProtection_Yes | TechSupport_No internet service | TechSupport_Yes | StreamingTV_No internet service | StreamingTV_Yes | StreamingMovies_No internet service | StreamingMovies_Yes | Contract_One year | Contract_Two year | PaperlessBilling_Yes | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 29.85 | 29.85 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 1 | 34 | 56.95 | 1889.50 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 2 | 53.85 | 108.15 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 3 | 45 | 42.30 | 1840.75 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2 | 70.70 | 151.65 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | 24 | 84.80 | 1990.50 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 7039 | 72 | 103.20 | 7362.90 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
| 7040 | 11 | 29.60 | 346.45 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 7041 | 4 | 74.40 | 306.60 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 7042 | 66 | 105.65 | 6844.50 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
7032 rows × 31 columns
corr=df_encode.corr(method='spearman')
corr=round(corr,2)
mask = np.triu(np.ones_like(corr, dtype=np.bool))
plt.figure(figsize=(19, 6))
heatmap = sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG',linewidths=1)
heatmap.set_title('Spearman Correlation Heatmap', fontdict={'fontsize':13}, pad=16,fontweight='bold')
plt.savefig('spearman_cor.jpg',dpi=360,bbox_inches='tight')
files.download("spearman_cor.jpg")
In this section, the analysis will begin by calculating the correlation of each variable to another with the Spearman method. Before calculating the correlation matrix, the categorical variable within the dataset first need to be encoded. As shown in the Spearman Correlation Heatmap,tenure and MonthlyCharges are highly correlated to TotalCharges, which is reasonable since TotalCharges is roughly similar to MonthlyCharges multiplied by tenure. Other interesting result is encoded variable StreamingTV_Yes has 0.53 correlation to StreamingMovies_Yes, which indicates the customer who have TV streaming service is likely to have movies streaming service also. One also can see from the heatmap that services with strongest correlation to MonthlyCharges are fiber optic with 0.8, no internet service with -0.71 and streaming service with 0.64, describing those services do have strong influences to the MonthlyCharges value. The heatmap also shows that Churn does not have quite strong correlation to any variable. The strongest correlations for the target variable Churn are -0.37 by tenure, 0.31 by fiber optic internet service, and 0.3 by electronic check.
df['Churn_encode']=df['Churn'].apply(lambda x: 1 if x=='Yes' else 0) #For the ease of bivariate analysis, new column churn_encode will be created
df
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | Churn_encode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | 0 |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No | 0 |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | 1 |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No | 0 |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | Male | No | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No | 0 |
| 7039 | Female | No | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No | 0 |
| 7040 | Female | No | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No | 0 |
| 7041 | Male | Yes | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes | 1 |
| 7042 | Male | No | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No | 0 |
7032 rows × 21 columns
# New function will be created for bivariate analysis bar chart
def group_for_stack(item_var,df=df):
df_agg=df.groupby(by=item_var)['Churn'].agg(['count']).reset_index()
df_agg.columns=[item_var,'count']
df_agg['churn']=df.groupby(by=item_var)['Churn_encode'].agg(['sum']).reset_index()['sum']
df_agg['per_churn']=df_agg['churn']/df_agg['count']
df_agg['per_not']=1-df_agg['per_churn']
return df_agg
def stackedbar_count_plot(axess,item_var,df=df):
df_agg=group_for_stack(item_var)
sns.barplot(data=df_agg,x=item_var,y='count',ax=axess,color='lightgreen')
sns.barplot(data=df_agg,x=item_var,y='churn',ax=axess,color='tomato')
for x,y,z in zip(df_agg.index,df_agg['churn'],df_agg.per_churn):
axess.annotate(str(round(z*100,1))+'%',(x,0.3*y),ha='center',weight='bold',fontsize=10)
for x,y,z in zip(df_agg.index,df_agg['churn']+0.5*(df_agg['count']-df_agg['churn']),df_agg.per_not):
axess.annotate(str(round(z*100,1))+'%',(x,y),ha='center',weight='bold',fontsize=10)
#fig.text(0.12,.9,'Count of '+item_var,fontweight='bold',ha='left',fontsize=11,color='red')
axess.set_title(item_var)
axess.set_xlabel('')
axess.set_ylabel('count')
green_patch = mpatches.Patch(color='lightgreen', label='Not churn')
red_patch = mpatches.Patch(color='tomato', label='Churn')
axess.legend(handles=[green_patch,red_patch])
return axess
demo_var=['gender','SeniorCitizen','Partner','Dependents']
fig,ax=plt.subplots(2,2,figsize=(10,7))
ax[0,0]=stackedbar_count_plot(axess=ax[0,0],item_var=demo_var[0])
ax[0,1]=stackedbar_count_plot(axess=ax[0,1],item_var=demo_var[1])
ax[1,0]=stackedbar_count_plot(axess=ax[1,0],item_var=demo_var[2])
ax[1,1]=stackedbar_count_plot(axess=ax[1,1],item_var=demo_var[3])
fig.text(0.12,.93,'Bivariate Analysis of Customer Demographic Attributes',fontweight='bold',ha='left',fontsize=13)
plt.savefig('customer_demo_bi.jpg',dpi=360,bbox_inches='tight')
files.download("customer_demo_bi.jpg")
demo_var=['Contract','PaperlessBilling','PaymentMethod']
fig,ax=plt.subplots(2,2,figsize=(10,7))
ax[0,0]=stackedbar_count_plot(axess=ax[0,0],item_var=demo_var[0])
ax[0,1]=stackedbar_count_plot(axess=ax[0,1],item_var=demo_var[1])
ax[1,0]=stackedbar_count_plot(axess=ax[1,0],item_var=demo_var[2])
for label in ax[1,0].get_xticklabels():
label.set_ha("center")
label.set_rotation(10)
ax[1,1].axis('off')
fig.text(0.12,.93,'Bivariate Analysis of Customer Account Attributes',fontweight='bold',ha='left',fontsize=13)
plt.savefig('customer_acc_bi.jpg',dpi=360,bbox_inches='tight')
files.download("customer_acc_bi.jpg")
demo_var=['PhoneService','MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
fig,ax=plt.subplots(3,3,figsize=(12,12))
k=0
for i in range(3):
for j in range(3):
ax[i,j]=stackedbar_count_plot(axess=ax[i,j],item_var=demo_var[k])
k=k+1
if j>0 :
ax[i,j].set_ylabel('')
#ax[1,1].axis('off')
fig.text(0.12,.91,'Bivariate Analysis of Customer Service Attributes',fontweight='bold',ha='left',fontsize=13)
plt.savefig('customer_service_bi.jpg',dpi=360,bbox_inches='tight')
files.download("customer_service_bi.jpg")
From the stacked bar charts of customer demographic attributes, customer account attributes, and customer service attributes one can notice that :
Both sexes, female and male have the same likelihood to churn, meaning that this variable don't give any valuable pieces of information. Same thing does not apply for Partner and Dependents since bar charts show that customer who has partner or dependents is less likely to churn. Bar chart also shows that senior customer has bigger chance to churn than younger one.
Customer Account Attributes stacked bar charts show that customer with monthly subscription has greater chance to churn than customer with yearly subscription. The bar charts also show customers with paperless billing enabled and has electronic check for payment method are more likely to churn.
From customer service attributes, it can be seen that roughly 40\% of customers who sign up for the fiber optic service choose to churn, meanwhile more than 90\% of customers who don't sign up for the internet service choose to remain. Moreover, it seems that customer who does not subscribe for additional internet services such as OnlineSecurity, OnlineBackup, DeviceProtection, and TechSupport has greater chance to churn than customer who subscribes. Same thing does not apply for both streaming services, since the churn probability of customer with or without streaming services enabled is roughly equal by 30\%. This value is considered high compared to OnlineSecurity, OnlineBackup and so on, indicating that numerous customers might less satisfied with the quality of the streaming services.
fig,ax=plt.subplots(2,2,figsize=(10,8))
sns.histplot(x='tenure',hue='Churn',data=df,multiple='stack',ax=ax[0,0],palette=['lightgreen','tomato'])
sns.histplot(x='MonthlyCharges',hue='Churn',data=df,multiple='stack',ax=ax[0,1],palette=['lightgreen','tomato'])
sns.histplot(x='TotalCharges',hue='Churn',data=df,multiple='stack',ax=ax[1,0],palette=['lightgreen','tomato'])
fig.text(0.12,.91,'Bivariate Analysis of Numerical Variables',fontweight='bold',ha='left',fontsize=13)
ax[1,1].axis('off')
plt.savefig('num_var_bi.jpg',dpi=360,bbox_inches='tight')
files.download("num_var_bi.jpg")
The bivariate analysis will also be conducted for the numerical variables such as tenure, MonthlyCharges, and TotalCharges. The tenure histogram shows that most customers who churned within the last month only subscribed for less than 5 months. This is definitely a loss as the company has actually managed to attract a lot of new customers in the last few months. From MonthlyCharges histogram, it appears that the majority of lost customers are charged by approximately 70-110 dollars per month. To urge more detail information, these 2 variables will be binned by dividing their values into 6 quantiles with roughly similar amount.
df['bin_tenure']=pd.qcut(df['tenure'],q=6) #Data binning
df['bin_MonthlyCharges']=pd.qcut(df['MonthlyCharges'],q=6)
df
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | Churn_encode | bin_tenure | bin_MonthlyCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | 0 | (0.999, 4.0] | (21.15, 50.4] |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No | 0 | (29.0, 47.0] | (50.4, 70.35] |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | 1 | (0.999, 4.0] | (50.4, 70.35] |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No | 0 | (29.0, 47.0] | (21.15, 50.4] |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | 1 | (0.999, 4.0] | (70.35, 84.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | Male | No | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No | 0 | (14.0, 29.0] | (84.0, 96.15] |
| 7039 | Female | No | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No | 0 | (64.0, 72.0] | (96.15, 118.75] |
| 7040 | Female | No | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No | 0 | (4.0, 14.0] | (21.15, 50.4] |
| 7041 | Male | Yes | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes | 1 | (0.999, 4.0] | (70.35, 84.0] |
| 7042 | Male | No | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No | 0 | (64.0, 72.0] | (96.15, 118.75] |
7032 rows × 23 columns
fig,ax=plt.subplots(1,2,figsize=(12,5))
stackedbar_count_plot(ax[0],'bin_tenure')
stackedbar_count_plot(ax[1],'bin_MonthlyCharges')
for label in ax[0].get_xticklabels():
label.set_ha("center")
label.set_rotation(20)
for label in ax[1].get_xticklabels():
label.set_ha("center")
label.set_rotation(20)
ax[0].set_title('')
ax[1].set_title('')
ax[0].set_xlabel('tenure binning')
ax[1].set_xlabel('MonthlyCharges binning')
fig.text(0.12,.91,'Tenure and MonthlyCharges Binning',fontweight='bold',ha='left',fontsize=13)
plt.savefig('binning.jpg',dpi=360,bbox_inches='tight')
files.download("binning.jpg")
df_tenure_10=df[df['tenure']<=20]
df_agg=df_tenure_10.groupby(by='tenure')['tenure'].agg(['count']).reset_index()
df_agg['churn']=df_tenure_10[df_tenure_10.Churn=='Yes'].groupby(by='tenure')['tenure'].count().values
fig,ax=plt.subplots()
sns.barplot(data=df_agg,x='tenure',y=[100]*20,color='lightgreen',ax=ax)
sns.barplot(x=df_agg.tenure,y=df_agg.churn*100/df_agg['count'],ax=ax,color='tomato')
green_patch = mpatches.Patch(color='lightgreen', label='Not churn')
red_patch = mpatches.Patch(color='tomato', label='Churn')
ax.legend(handles=[green_patch,red_patch])
ax.set_ylabel('Percentage (%)')
ax.set_title('Percentage of customer who churn with tenure ranged from 1 - 20 months')
Text(0.5, 1.0, 'Percentage of customer who churn with tenure ranged from 1 - 20 months')
Tenure binning chart appears an instinctive result as the churn likelihood gets smaller as the membership time gets longer. It also tells that more than 50\% of customers who only subscribed less than 4 months prefer to churn (mostly even churn in their first month). From MonthlyCharges binning, it can be seen that the premium customers who are billed more than 70 dollars per month are more likely to churn compared to other customers with less bill. From the business perspective, it is surely more beneficial for the company to have a great focus improving on the premium services since those services have more lost customers and donate more month-to-month income for the company. Another interesting result is the customer who only subscribes for basic service (cheapest monthly charges) seems quite satistied with the service quality and less likely to churn.
From bivariate analysis, one can figure several categories that may represent who the lost customer is. However, the question of which factors is the main culprit remains unanswered. Consider that it is impossible for customers who use electronic check for the payment method decide to churn without any reasons. In this section, that question will be tried to solve.
p1=ggplot(df,aes(x='InternetService',y='MonthlyCharges'))+geom_jitter(aes(color='InternetService'))+facet_grid('.~Churn')+theme_bw()+labs(title='InternetService vs MonthlyCharges vs Churn',subtitle='s')
p1.draw()
plt.savefig('justifi.jpg',dpi=360,bbox_inches='tight')
files.download("justifi.jpg")
fig,ax=plt.subplots(2,2,figsize=(10,8))
sns.boxplot(data=df,x='PaperlessBilling',y='MonthlyCharges',ax=ax[0,0])
sns.boxplot(data=df,x='PaymentMethod',y='MonthlyCharges',ax=ax[0,1])
sns.boxplot(data=df,x='SeniorCitizen',y='MonthlyCharges',ax=ax[1,0])
for label in ax[0,1].get_xticklabels():
label.set_ha("center")
label.set_rotation(10)
ax[1,1].axis('off')
fig.text(0.12,.95,'Justification for PaperlessBilling, PaymentMethod, and SeniorCitizen',fontweight='bold',ha='left',fontsize=13)
fig.text(0.12,.92,'MonthlyCharges distributions for PaperlessBilling Yes, SeniorCitizen Yes, and Electronic check clustered around 60-95',ha='left',fontsize=11)
ax[0,0].set_title('PaperlessBilling')
ax[0,1].set_title('PaymentMethod')
ax[1,0].set_title('PaperlessBilling')
ax[0,0].set_xlabel('')
ax[0,1].set_xlabel('')
ax[1,0].set_xlabel('')
plt.savefig('justify_sc.jpg',dpi=360,bbox_inches='tight')
files.download("justify_sc.jpg")
Above jitter plot depicts why many customers who pay more than 70 dollars per month choose to churn. The reason is due to the internet service they bought. As shown in that plot, it is evident that customers who churn with MonthlyCharges ranging from 70 to 120 dollars are usually customers who use FiberOptic service. From the previous section, one of the most important insights is Fiber Optic service indeed has higher rate of churn. That's also why SeniorCitizen, ElectronicCheck, and PaperlessBilling have higher churn rates than others. The boxplots show that MonthlyCharges distribution for SeniorCitizen Yes, PaperlessBilling Yes and Electronic check are clustered around the churn-critical area, i.e 60-90 dollars per month.
Another key takeaways from the preceding section is that many customers who join up for the streaming service choose to churn (more than other additional services such as OnlineSecurity, OnlineBackup, etc). Because the correlation matrix indicates that customers who use TV streaming services are more likely to use movies streaming services as well, these two variables will be combined to form a new variable, Streaming.
def streaming_service(x):
if x.StreamingTV=='Yes' or x.StreamingMovies=='Yes':
return 'Yes'
elif x.StreamingTV=='No' and x.StreamingMovies=='No':
return 'No'
elif x.StreamingTV=='No internet service' and x.StreamingMovies=='No internet service':
return 'No internet service'
df['Streaming']=df.apply(streaming_service,axis=1)
p1=ggplot(df,aes(x='InternetService'))+geom_bar(aes(fill='Streaming'),position='fill',stat='count')+scale_y_continuous(labels=lambda l: [f'{v*100}%' for v in l])+labs(title='Percentage of Streaming Service')+theme_minimal()
p1
<ggplot: (8793361836913)>
Above bar chart shows that the vast majority of customers who use internet also sign up for the streaming service. It can be seen that roughly 70\% of Fiber optic users subscribe for the streaming service. However, two important insights obtained from the previous section are both of these variables contribute to the churn decision. So in which variable does the main problem exist? Does it exist in the streaming service or the kind of the internet service itself, i.e Fiber optic?
p1=ggplot(df[df['InternetService']=='Fiber optic'],aes(x='Streaming'))+geom_bar(aes(fill='Churn'),position='fill',stat='count')+scale_y_continuous(labels=lambda l: [f'{v*100}%' for v in l])+labs(title='Fiber Optic user : Streaming vs Churn')+theme_minimal()
p1
<ggplot: (8793361912317)>
p1=ggplot(df[df['InternetService']=='DSL'],aes(x='Streaming'))+geom_bar(aes(fill='Churn'),position='fill',stat='count')+scale_y_continuous(labels=lambda l: [f'{v*100}%' for v in l])+labs(title='Fiber Optic user : Streaming vs Churn')+theme_minimal()
p1
<ggplot: (8793360252701)>
These 2 figures appear as an answer to those questions. From these figures it is evident that the customers who do not subscribe for any Streaming services actually have slightly higher probability to churn in both DSL and Fiber optic services. Moreover, the Churn probability for Fiber optic users almost two times higher than DSL users. Both of these findings indicate that the major issue is the Fiber Optic service itself.
p1=ggplot(df,aes(x='tenure',y='MonthlyCharges'))+geom_point(aes(color='Contract'))+facet_grid('.~Churn')+scale_color_manual(['lightgreen','tomato','black'])+theme_bw()+labs(title='tenure vs MonthlyCharges vs Churn (Yes/No) vs Contract')
p1
<ggplot: (8793360996077)>
In addition to InternetService, tenure is another significant factor in this dataset. Above figure shows that practically almost every value of MonthlyCharges has customers who churn in their first 6 months. This suggests that the majority of new telecom customers may have negative beginning experiences. Another finding is as tenure increases the contract also increases from monthly becomes yearly. That's why in the small tenure period, mostly the contracts are monthly and customers with that contract are more likely to churn than those who have yearly contract. This finding is actually quite reasonable because monthly contract customers are frequently new and still not satisfied enough with the service quality and unsure whether they will remain loyal to the company. That's why one also can see that as the period of contract increases the churn probability decreases.
From all the results above it can be concluded that :
In the last month, the company lost 1869 its customers, representing a churn rate of approximately 27\%. The main causes of customer churn are poor quality Fiber optic services and negative first-time customer experiences. As a result, nearly 42\% of Fiber optic users chose to churn in the last month, and more than 55\% of new customers (only subscribed for less than 4 months) chose to churn as well.
Because the vast majority of customers (44\%) actually sign up for fiber optic service and pay such high monthly fees (around 70-120 dollars), the company should focus more on improving the quality in order to reduce its significant revenue loss. They should also improve the experience of their new customers and target those who are young or middle-aged, have a partner or dependents, as they are less likely to churn.